57 research outputs found

    Tuning-Robust Initialization Methods for Speaker Diarization

    Get PDF
    This paper investigates a typical speaker diarization system regarding its robustness against initialization parameter variation and presents a method to reduce manual tuning of these values significantly. The behavior of an agglomerative hierarchical clustering system is studied to determine which initialization parameters impact accuracy most. We show that the accuracy of typical systems is indeed very sensitive to the values chosen for the initialization parameters and factors such as the duration of speech in the recording. We then present a solution that reduces the sensitivity of the initialization values and therefore reduces the need for manual tuning significantly while at the same time increasing the accuracy of the system. For short meetings extracted from the previous (2006, 2007, and 2009) National Institute of Standards and Technology (NIST) Rich Transcription (RT) evaluation data, the decrease of the diarization error rate is up to 50% relative. The approach consists of a novel initialization parameter estimation method for speaker diarization that uses agglomerative clustering with Bayesian information criterion (BIC) and Gaussian mixture models (GMMs) of frame-based cepstral features (MFCCs). The estimation method balances the relationship between the optimal value of the seconds of speech data per Gaussian and the duration of the speech data and is combined with a novel nonuniform initialization method. This approach results in a system that performs better than the current ICSI baseline engine on datasets of the NIST RT evaluations of the years 2006, 2007, and 2009

    Robust Speaker Diarization for Short Speech Recordings

    Get PDF
    We investigate a state-of-the-art Speaker Diarization system regarding its behavior on meetings that are much shorter (from 500 seconds down to 100 seconds) than those typically analyzed in Speaker Diarization benchmarks. First, the problems inherent to this task are analyzed. Then, we propose an approach that consists of a novel initialization parameter estimation method for typical state-of-the-art diarization approaches. The estimation method balances the relationship between the optimal value of the duration of speech data per Gaussian and the duration of the speech data, which is verified experimentally for the first time in this article. As a result, the Diarization Error Rate for short meetings extracted from the 2006, 2007, and 2009 NIST RT evaluation data is decreased by up to 50% relative

    Statistical models for HMM/ANN hybrids

    Get PDF
    We present a theoretical investigation into the use of normalised artificial neural network (ANN) outputs in the context of hidden Markov models (HMMs). The work is motivated by the pursuit of a more theoretically rigorous understanding of the Kullback-Liebler (KL)-HMM. Two possible models are considered based respectively on the HMM states storing categorical distributions and Dirichlet distributions. Training and recognition algorithms are derived, and possible relationships with KL-HMM are briefly discussed

    Leveraging speaker diarization for meeting recognition from distant microphones

    Get PDF
    ABSTRACT We investigate using state-of-the-art speaker diarization output for speech recognition purposes. While it seems obvious that speech recognition could benefit from the output of speaker diarization ("Who spoke when") for effective feature normalization and model adaptation, such benefits have remained elusive in the very challenging domain of meeting recognition from distant microphones. In this study, we show that recognition gains are possible by careful postprocessing of the diarization output. Still, recognition accuracy may suffer when the underlying diarization system performs worse than expected, even compared to far less sophisticated speaker-clustering techniques. We obtain a more accurate and robust overall system by combining recognition output with multiple speaker segmentations and clusterings. We evaluate our methods on data from the 2009 NIST Rich Transcription meeting recognition evaluation

    Using KL-divergence and multilingual information to improve ASR for under-resourced languages

    Get PDF
    Setting out from the point of view that automatic speech recognition (ASR) ought to benefit from data in languages other than the target language, we propose a novel Kullback-Leibler (KL) divergence based method that is able to exploit multilingual information in the form of universal phoneme posterior probabilities conditioned on the acoustics. We formulate a means to train a recognizer on several different languages, and subsequently recognize speech in a target language for which only a small amount of data is available. Taking the Greek SpeechDat(II) data as an example, we show that the proposed formulation is sound, and show that it is able to outperform a current state-of-the-art HMM/GMM system. We also use a hybrid Tandem-like system to further understand the source of the benefit

    Hierarchical Multilayer Perceptron based Language Identification

    Get PDF
    Automatic language identification (LID) systems generally exploit acoustic knowledge, possibly enriched by explicit language specific phonotactic or lexical constraints. This paper investigates a new LID approach based on hierarchical multilayer perceptron (MLP) classifiers, where the first layer is a "universal phoneme set MLP classifier''. The resulting (multilingual) phoneme posterior sequence is fed into a second MLP taking a larger temporal context into account. The second MLP can learn/exploit implicitly different types of patterns/information such as confusion between phonemes and/or phonotactics for LID. We investigate the viability of the proposed approach by comparing it against two standard approaches which use phonotactic and lexical constraints with the universal phoneme set MLP classifier as emission probability estimator. On SpeechDat(II) datasets of five European languages, the proposed approach yields significantly better performance compared to the two standard approaches

    Robust triphone mapping for acoustic modeling

    Get PDF
    In this paper we revisit the recently proposed triphone mapping as an alternative to decision tree state clustering. We generalize triphone mapping to Kullback-Leibler based hidden Markov models for acoustic modeling and propose a modified training procedure for the Gaussian mixture model based acoustic modeling. We compare the triphone mapping to decision tree state clustering on the Wall Street journal task as well as in the context of an under-resourced language by using Greek data from the SpeechDat(II) corpus. Experiments reveal that triphone mapping has the best overall performance and is robust against varying the acoustic modeling technique as well as variable amounts of training data
    • …
    corecore